Problem Statement¶
Business Context¶
Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
Objective¶
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
- False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
- False positives (FP) are detections where there is no failure. These will result in inspection costs.
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
Data Description¶
- The data provided is a transformed version of original data which was collected using sensors.
- Train.csv - To be used for training and tuning of models.
- Test.csv - To be used only for testing the performance of the final best model.
- Both the datasets consist of 40 predictor variables and 1 target variable
Importing necessary libraries¶
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 imbalanced-learn==0.10.1 xgboost==2.0.3 threadpoolctl==3.3.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
Loading the dataset¶
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# Code to let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Complete the code to read the training data
df_train = pd.read_csv('/content/drive/My Drive/Train.csv')
# Complete the code to read the test data
df_test = pd.read_csv('/content/drive/My Drive/Test.csv')
Data Overview¶
- Observations
- Sanity checks
Checking the shape of the dataset¶
# Checking the number of rows and columns in the training data
df_train.shape
(20000, 41)
# Checking the number of rows and columns in the test data
df_test.shape ## Complete the code to view dimensions of the test data
(5000, 41)
# Code to create a copy of the training data
data = df_train.copy()
# Code to create a copy of the test data
data_test = df_test.copy()
Observations
As indicated in the objective statement, the data has 40 predictors, 20000 observations in the training set and 5000 in the test set. Note that there is one target variable. Hence, the output returning 41.
Displaying the first and last five rows of the dataset for both training and testing.¶
# Code to view the first 5 rows of the training data
data.head() ## Complete the code to view top 5 rows of the training data
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
# Code to view the last 5 rows of the training data
data.tail() ## Complete the code to view last 5 rows of the training data
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19995 | -2.071 | -1.088 | -0.796 | -3.012 | -2.288 | 2.807 | 0.481 | 0.105 | -0.587 | -2.899 | 8.868 | 1.717 | 1.358 | -1.777 | 0.710 | 4.945 | -3.100 | -1.199 | -1.085 | -0.365 | 3.131 | -3.948 | -3.578 | -8.139 | -1.937 | -1.328 | -0.403 | -1.735 | 9.996 | 6.955 | -3.938 | -8.274 | 5.745 | 0.589 | -0.650 | -3.043 | 2.216 | 0.609 | 0.178 | 2.928 | 1 |
| 19996 | 2.890 | 2.483 | 5.644 | 0.937 | -1.381 | 0.412 | -1.593 | -5.762 | 2.150 | 0.272 | -2.095 | -1.526 | 0.072 | -3.540 | -2.762 | -10.632 | -0.495 | 1.720 | 3.872 | -1.210 | -8.222 | 2.121 | -5.492 | 1.452 | 1.450 | 3.685 | 1.077 | -0.384 | -0.839 | -0.748 | -1.089 | -4.159 | 1.181 | -0.742 | 5.369 | -0.693 | -1.669 | 3.660 | 0.820 | -1.987 | 0 |
| 19997 | -3.897 | -3.942 | -0.351 | -2.417 | 1.108 | -1.528 | -3.520 | 2.055 | -0.234 | -0.358 | -3.782 | 2.180 | 6.112 | 1.985 | -8.330 | -1.639 | -0.915 | 5.672 | -3.924 | 2.133 | -4.502 | 2.777 | 5.728 | 1.620 | -1.700 | -0.042 | -2.923 | -2.760 | -2.254 | 2.552 | 0.982 | 7.112 | 1.476 | -3.954 | 1.856 | 5.029 | 2.083 | -6.409 | 1.477 | -0.874 | 0 |
| 19998 | -3.187 | -10.052 | 5.696 | -4.370 | -5.355 | -1.873 | -3.947 | 0.679 | -2.389 | 5.457 | 1.583 | 3.571 | 9.227 | 2.554 | -7.039 | -0.994 | -9.665 | 1.155 | 3.877 | 3.524 | -7.015 | -0.132 | -3.446 | -4.801 | -0.876 | -3.812 | 5.422 | -3.732 | 0.609 | 5.256 | 1.915 | 0.403 | 3.164 | 3.752 | 8.530 | 8.451 | 0.204 | -7.130 | 4.249 | -6.112 | 0 |
| 19999 | -2.687 | 1.961 | 6.137 | 2.600 | 2.657 | -4.291 | -2.344 | 0.974 | -1.027 | 0.497 | -9.589 | 3.177 | 1.055 | -1.416 | -4.669 | -5.405 | 3.720 | 2.893 | 2.329 | 1.458 | -6.429 | 1.818 | 0.806 | 7.786 | 0.331 | 5.257 | -4.867 | -0.819 | -5.667 | -2.861 | 4.674 | 6.621 | -1.989 | -1.349 | 3.952 | 5.450 | -0.455 | -2.202 | 1.678 | -1.974 | 0 |
Observations
The training dataset has 20000 observations.
# Code to view the first 5 rows of the test data
data_test.head() ## Complete the code to view first 5 rows of the test data
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613 | -3.820 | 2.202 | 1.300 | -1.185 | -4.496 | -1.836 | 4.723 | 1.206 | -0.342 | -5.123 | 1.017 | 4.819 | 3.269 | -2.984 | 1.387 | 2.032 | -0.512 | -1.023 | 7.339 | -2.242 | 0.155 | 2.054 | -2.772 | 1.851 | -1.789 | -0.277 | -1.255 | -3.833 | -1.505 | 1.587 | 2.291 | -5.411 | 0.870 | 0.574 | 4.157 | 1.428 | -10.511 | 0.455 | -1.448 | 0 |
| 1 | 0.390 | -0.512 | 0.527 | -2.577 | -1.017 | 2.235 | -0.441 | -4.406 | -0.333 | 1.967 | 1.797 | 0.410 | 0.638 | -1.390 | -1.883 | -5.018 | -3.827 | 2.418 | 1.762 | -3.242 | -3.193 | 1.857 | -1.708 | 0.633 | -0.588 | 0.084 | 3.014 | -0.182 | 0.224 | 0.865 | -1.782 | -2.475 | 2.494 | 0.315 | 2.059 | 0.684 | -0.485 | 5.128 | 1.721 | -1.488 | 0 |
| 2 | -0.875 | -0.641 | 4.084 | -1.590 | 0.526 | -1.958 | -0.695 | 1.347 | -1.732 | 0.466 | -4.928 | 3.565 | -0.449 | -0.656 | -0.167 | -1.630 | 2.292 | 2.396 | 0.601 | 1.794 | -2.120 | 0.482 | -0.841 | 1.790 | 1.874 | 0.364 | -0.169 | -0.484 | -2.119 | -2.157 | 2.907 | -1.319 | -2.997 | 0.460 | 0.620 | 5.632 | 1.324 | -1.752 | 1.808 | 1.676 | 0 |
| 3 | 0.238 | 1.459 | 4.015 | 2.534 | 1.197 | -3.117 | -0.924 | 0.269 | 1.322 | 0.702 | -5.578 | -0.851 | 2.591 | 0.767 | -2.391 | -2.342 | 0.572 | -0.934 | 0.509 | 1.211 | -3.260 | 0.105 | -0.659 | 1.498 | 1.100 | 4.143 | -0.248 | -1.137 | -5.356 | -4.546 | 3.809 | 3.518 | -3.074 | -0.284 | 0.955 | 3.029 | -1.367 | -3.412 | 0.906 | -2.451 | 0 |
| 4 | 5.828 | 2.768 | -1.235 | 2.809 | -1.642 | -1.407 | 0.569 | 0.965 | 1.918 | -2.775 | -0.530 | 1.375 | -0.651 | -1.679 | -0.379 | -4.443 | 3.894 | -0.608 | 2.945 | 0.367 | -5.789 | 4.598 | 4.450 | 3.225 | 0.397 | 0.248 | -2.362 | 1.079 | -0.473 | 2.243 | -3.591 | 1.774 | -1.502 | -2.227 | 4.777 | -6.560 | -0.806 | -0.276 | -3.858 | -0.538 | 0 |
# Code to view the last 5 rows of the test data
data_test.tail() ## Complete the code to view last 5 rows of the test data
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | -5.120 | 1.635 | 1.251 | 4.036 | 3.291 | -2.932 | -1.329 | 1.754 | -2.985 | 1.249 | -6.878 | 3.715 | -2.512 | -1.395 | -2.554 | -2.197 | 4.772 | 2.403 | 3.792 | 0.487 | -2.028 | 1.778 | 3.668 | 11.375 | -1.977 | 2.252 | -7.319 | 1.907 | -3.734 | -0.012 | 2.120 | 9.979 | 0.063 | 0.217 | 3.036 | 2.109 | -0.557 | 1.939 | 0.513 | -2.694 | 0 |
| 4996 | -5.172 | 1.172 | 1.579 | 1.220 | 2.530 | -0.669 | -2.618 | -2.001 | 0.634 | -0.579 | -3.671 | 0.460 | 3.321 | -1.075 | -7.113 | -4.356 | -0.001 | 3.698 | -0.846 | -0.222 | -3.645 | 0.736 | 0.926 | 3.278 | -2.277 | 4.458 | -4.543 | -1.348 | -1.779 | 0.352 | -0.214 | 4.424 | 2.604 | -2.152 | 0.917 | 2.157 | 0.467 | 0.470 | 2.197 | -2.377 | 0 |
| 4997 | -1.114 | -0.404 | -1.765 | -5.879 | 3.572 | 3.711 | -2.483 | -0.308 | -0.922 | -2.999 | -0.112 | -1.977 | -1.623 | -0.945 | -2.735 | -0.813 | 0.610 | 8.149 | -9.199 | -3.872 | -0.296 | 1.468 | 2.884 | 2.792 | -1.136 | 1.198 | -4.342 | -2.869 | 4.124 | 4.197 | 3.471 | 3.792 | 7.482 | -10.061 | -0.387 | 1.849 | 1.818 | -1.246 | -1.261 | 7.475 | 0 |
| 4998 | -1.703 | 0.615 | 6.221 | -0.104 | 0.956 | -3.279 | -1.634 | -0.104 | 1.388 | -1.066 | -7.970 | 2.262 | 3.134 | -0.486 | -3.498 | -4.562 | 3.136 | 2.536 | -0.792 | 4.398 | -4.073 | -0.038 | -2.371 | -1.542 | 2.908 | 3.215 | -0.169 | -1.541 | -4.724 | -5.525 | 1.668 | -4.100 | -5.949 | 0.550 | -1.574 | 6.824 | 2.139 | -4.036 | 3.436 | 0.579 | 0 |
| 4999 | -0.604 | 0.960 | -0.721 | 8.230 | -1.816 | -2.276 | -2.575 | -1.041 | 4.130 | -2.731 | -3.292 | -1.674 | 0.465 | -1.646 | -5.263 | -7.988 | 6.480 | 0.226 | 4.963 | 6.752 | -6.306 | 3.271 | 1.897 | 3.271 | -0.637 | -0.925 | -6.759 | 2.990 | -0.814 | 3.499 | -8.435 | 2.370 | -1.062 | 0.791 | 4.952 | -7.441 | -0.070 | -0.918 | -2.291 | -5.363 | 0 |
Observations
The test data has 5000 observations, taking into account zero indexing in Python.
Checking the data types of the columns for the dataset for both training and testing¶
# Code to check the data types of the columns in the training dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
Observations Float is 40 predictors. The target variable is one, and it's an integer. Memory usage is 6.3MB. There appears to be some missing values in V1 and V2.
# Code to check the data types of the columns in the test dataset
data_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
Observations
Float is 40 predictors. The target variable is one, and it's an integer. Memory usage is 1.6MB. There appears to be some missing values in V1 and V2.
Checking for missing values¶
# Code to check missing value in the training dataset
df_train.isnull().sum()
| 0 | |
|---|---|
| V1 | 18 |
| V2 | 18 |
| V3 | 0 |
| V4 | 0 |
| V5 | 0 |
| V6 | 0 |
| V7 | 0 |
| V8 | 0 |
| V9 | 0 |
| V10 | 0 |
| V11 | 0 |
| V12 | 0 |
| V13 | 0 |
| V14 | 0 |
| V15 | 0 |
| V16 | 0 |
| V17 | 0 |
| V18 | 0 |
| V19 | 0 |
| V20 | 0 |
| V21 | 0 |
| V22 | 0 |
| V23 | 0 |
| V24 | 0 |
| V25 | 0 |
| V26 | 0 |
| V27 | 0 |
| V28 | 0 |
| V29 | 0 |
| V30 | 0 |
| V31 | 0 |
| V32 | 0 |
| V33 | 0 |
| V34 | 0 |
| V35 | 0 |
| V36 | 0 |
| V37 | 0 |
| V38 | 0 |
| V39 | 0 |
| V40 | 0 |
| Target | 0 |
# Code to check missing value in the training dataset
df_test.isnull().sum()
| 0 | |
|---|---|
| V1 | 5 |
| V2 | 6 |
| V3 | 0 |
| V4 | 0 |
| V5 | 0 |
| V6 | 0 |
| V7 | 0 |
| V8 | 0 |
| V9 | 0 |
| V10 | 0 |
| V11 | 0 |
| V12 | 0 |
| V13 | 0 |
| V14 | 0 |
| V15 | 0 |
| V16 | 0 |
| V17 | 0 |
| V18 | 0 |
| V19 | 0 |
| V20 | 0 |
| V21 | 0 |
| V22 | 0 |
| V23 | 0 |
| V24 | 0 |
| V25 | 0 |
| V26 | 0 |
| V27 | 0 |
| V28 | 0 |
| V29 | 0 |
| V30 | 0 |
| V31 | 0 |
| V32 | 0 |
| V33 | 0 |
| V34 | 0 |
| V35 | 0 |
| V36 | 0 |
| V37 | 0 |
| V38 | 0 |
| V39 | 0 |
| V40 | 0 |
| Target | 0 |
Observations
The training dataset has two missing values of 18 and 18 for V1 and V2 respectively.
The testing dataset has two missing values of 5 and 6 for V1 and V2 respectively.
These would have to be treated to ensure the integrity of the data for modeling, but before that, let's check if there are any duplicates in the dataset for both training and testing.
Checking for duplicate values¶
# Code to check for duplicate values in the training dataset
df_train.duplicated().sum()
0
# Code to check for duplicate values in the test dataset
df_test.duplicated().sum()
0
Observations
There are no duplicates in the dataset for both training and testing
Statistical summary of the dataset¶
# Code to check statistical summary of the training dataset
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.000 | -0.272 | 3.442 | -11.876 | -2.737 | -0.748 | 1.840 | 15.493 |
| V2 | 19982.000 | 0.440 | 3.151 | -12.320 | -1.641 | 0.472 | 2.544 | 13.089 |
| V3 | 20000.000 | 2.485 | 3.389 | -10.708 | 0.207 | 2.256 | 4.566 | 17.091 |
| V4 | 20000.000 | -0.083 | 3.432 | -15.082 | -2.348 | -0.135 | 2.131 | 13.236 |
| V5 | 20000.000 | -0.054 | 2.105 | -8.603 | -1.536 | -0.102 | 1.340 | 8.134 |
| V6 | 20000.000 | -0.995 | 2.041 | -10.227 | -2.347 | -1.001 | 0.380 | 6.976 |
| V7 | 20000.000 | -0.879 | 1.762 | -7.950 | -2.031 | -0.917 | 0.224 | 8.006 |
| V8 | 20000.000 | -0.548 | 3.296 | -15.658 | -2.643 | -0.389 | 1.723 | 11.679 |
| V9 | 20000.000 | -0.017 | 2.161 | -8.596 | -1.495 | -0.068 | 1.409 | 8.138 |
| V10 | 20000.000 | -0.013 | 2.193 | -9.854 | -1.411 | 0.101 | 1.477 | 8.108 |
| V11 | 20000.000 | -1.895 | 3.124 | -14.832 | -3.922 | -1.921 | 0.119 | 11.826 |
| V12 | 20000.000 | 1.605 | 2.930 | -12.948 | -0.397 | 1.508 | 3.571 | 15.081 |
| V13 | 20000.000 | 1.580 | 2.875 | -13.228 | -0.224 | 1.637 | 3.460 | 15.420 |
| V14 | 20000.000 | -0.951 | 1.790 | -7.739 | -2.171 | -0.957 | 0.271 | 5.671 |
| V15 | 20000.000 | -2.415 | 3.355 | -16.417 | -4.415 | -2.383 | -0.359 | 12.246 |
| V16 | 20000.000 | -2.925 | 4.222 | -20.374 | -5.634 | -2.683 | -0.095 | 13.583 |
| V17 | 20000.000 | -0.134 | 3.345 | -14.091 | -2.216 | -0.015 | 2.069 | 16.756 |
| V18 | 20000.000 | 1.189 | 2.592 | -11.644 | -0.404 | 0.883 | 2.572 | 13.180 |
| V19 | 20000.000 | 1.182 | 3.397 | -13.492 | -1.050 | 1.279 | 3.493 | 13.238 |
| V20 | 20000.000 | 0.024 | 3.669 | -13.923 | -2.433 | 0.033 | 2.512 | 16.052 |
| V21 | 20000.000 | -3.611 | 3.568 | -17.956 | -5.930 | -3.533 | -1.266 | 13.840 |
| V22 | 20000.000 | 0.952 | 1.652 | -10.122 | -0.118 | 0.975 | 2.026 | 7.410 |
| V23 | 20000.000 | -0.366 | 4.032 | -14.866 | -3.099 | -0.262 | 2.452 | 14.459 |
| V24 | 20000.000 | 1.134 | 3.912 | -16.387 | -1.468 | 0.969 | 3.546 | 17.163 |
| V25 | 20000.000 | -0.002 | 2.017 | -8.228 | -1.365 | 0.025 | 1.397 | 8.223 |
| V26 | 20000.000 | 1.874 | 3.435 | -11.834 | -0.338 | 1.951 | 4.130 | 16.836 |
| V27 | 20000.000 | -0.612 | 4.369 | -14.905 | -3.652 | -0.885 | 2.189 | 17.560 |
| V28 | 20000.000 | -0.883 | 1.918 | -9.269 | -2.171 | -0.891 | 0.376 | 6.528 |
| V29 | 20000.000 | -0.986 | 2.684 | -12.579 | -2.787 | -1.176 | 0.630 | 10.722 |
| V30 | 20000.000 | -0.016 | 3.005 | -14.796 | -1.867 | 0.184 | 2.036 | 12.506 |
| V31 | 20000.000 | 0.487 | 3.461 | -13.723 | -1.818 | 0.490 | 2.731 | 17.255 |
| V32 | 20000.000 | 0.304 | 5.500 | -19.877 | -3.420 | 0.052 | 3.762 | 23.633 |
| V33 | 20000.000 | 0.050 | 3.575 | -16.898 | -2.243 | -0.066 | 2.255 | 16.692 |
| V34 | 20000.000 | -0.463 | 3.184 | -17.985 | -2.137 | -0.255 | 1.437 | 14.358 |
| V35 | 20000.000 | 2.230 | 2.937 | -15.350 | 0.336 | 2.099 | 4.064 | 15.291 |
| V36 | 20000.000 | 1.515 | 3.801 | -14.833 | -0.944 | 1.567 | 3.984 | 19.330 |
| V37 | 20000.000 | 0.011 | 1.788 | -5.478 | -1.256 | -0.128 | 1.176 | 7.467 |
| V38 | 20000.000 | -0.344 | 3.948 | -17.375 | -2.988 | -0.317 | 2.279 | 15.290 |
| V39 | 20000.000 | 0.891 | 1.753 | -6.439 | -0.272 | 0.919 | 2.058 | 7.760 |
| V40 | 20000.000 | -0.876 | 3.012 | -11.024 | -2.940 | -0.921 | 1.120 | 10.654 |
| Target | 20000.000 | 0.056 | 0.229 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
# Calculate descriptive statistics for the training data
desc_stats = data.describe()
# Create a DataFrame to store the desired statistics
max_values_df = pd.DataFrame({
'Feature': desc_stats.columns,
'Max Value': desc_stats.loc['max'].values,
'Mean': desc_stats.loc['mean'].values,
'Median': desc_stats.loc['50%'].values,
'Standard Deviation': desc_stats.loc['std'].values
})
# Sort by 'Max Value' in descending order and select top 5
top_5_max_values = max_values_df.sort_values(by='Max Value', ascending=False).head(5)
# Display the results
print(top_5_max_values)
Feature Max Value Mean Median Standard Deviation 31 V32 23.633 0.304 0.052 5.500 35 V36 19.330 1.515 1.567 3.801 26 V27 17.560 -0.612 -0.885 4.369 30 V31 17.255 0.487 0.490 3.461 23 V24 17.163 1.134 0.969 3.912
Observations
Top 5 maximum values are V32, V36, V27, V31, and V24 with median values of 0.052, 1.567, -0.885, 0.490, 0.969 respectively. We can also observe their mean and standard deviations.
Exploratory Data Analysis (EDA)¶
Univariate Analysis¶
Plotting histograms and boxplots for all the variables¶
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Plotting all the features at one go¶
for feature in data.columns:
histogram_boxplot(
data, feature, figsize=(12, 7), kde=False, bins=None
) ## Please change the dataframe name as you define while reading the data
Values in Target variable¶
# Target variable is named 'Target' in the DataFrame 'data'
Target= data['Target'].value_counts()
# Print the distribution
print(Target)
# Access the counts for failed and not failed
failed_count = Target[1]
not_failed_count = Target[0]
# Print the counts
print(f"Failed: {failed_count}")
print(f"Not Failed: {not_failed_count}")
0 18890 1 1110 Name: Target, dtype: int64 Failed: 1110 Not Failed: 18890
# Assuming your test data is in a DataFrame called 'data_test'
Target_test = data_test['Target'].value_counts()
# Print the distribution
print(Target_test)
# Access the counts for failed and not failed
failed_count_test = Target_test[1]
not_failed_count_test = Target_test[0]
# Print the counts
print(f"Failed: {failed_count_test}")
print(f"Not Failed: {not_failed_count_test}")
0 4718 1 282 Name: Target, dtype: int64 Failed: 282 Not Failed: 4718
Observations
A few outliers can be observed from the box plots. If these are incorrectly recorded, they need to be corrected before we analyse the data further. If they belong to another dataset but have been incorrectly included here, they need to be removed. It is clear that the outliers here are true values and are part of the dataset so we will keep them.
The histplot for all the 40 predictor variables follow a normal distribution with some slight skewness, either to the left or to the right.
For the Target variable, we have more observations for Not Failed (18890) than Failed (1110) for the training data, and Not failed (4718) and Failed(282) for the testing data. Not failed are in the majority.
Data Pre-processing¶
# Dividing train data into X and y
X = data.drop(["Target"], axis=1)
y = data["Target"]
# Splitting train dataset into training and validation set
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.25, random_state=1, stratify=y
)
print(X_train.shape, X_val.shape) ## Complete the code to split the train dataset into train and validation set in the ratio 75:25. For the overall data, the split becomes 60:20:20 for train, validation, and test
(15000, 40) (5000, 40)
# Checking the number of rows and columns in the X_train data
X_train.shape ## Complete the code to view dimensions of the X_train data
# Checking the number of rows and columns in the X_val data
X_val.shape ## Complete the code to view dimensions of the X_val data
(5000, 40)
Observations
The training set has 15000 observations. The validation set has 5000 observations
# Dividing test data into X_test and y_test
X_test = data_test.drop('Target', axis=1) ## Complete the code to drop target variable from test data
y_test = data_test['Target'] ## Complete the code to store target variable in y_test
# Checking the number of rows and columns in the X_test data
X_test.shape ## Complete the code to view dimensions of the X_test data
(5000, 40)
Observations
The test data has 5000 observations, with 40 variables.
Missing value imputation¶
# creating an instance of the imputer to be used
imputer = SimpleImputer(strategy="median")
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
# Transform the validation data
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns) ## Complete the code to impute missing values in X_val without data leakage
# Transform the test data
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_train.columns) ## Complete the code to impute missing values in X_test without data leakage
# Checking that no column has missing values in train, validation, or test sets
X_train.info()
X_val.info()
X_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15000 entries, 0 to 14999 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 15000 non-null float64 1 V2 15000 non-null float64 2 V3 15000 non-null float64 3 V4 15000 non-null float64 4 V5 15000 non-null float64 5 V6 15000 non-null float64 6 V7 15000 non-null float64 7 V8 15000 non-null float64 8 V9 15000 non-null float64 9 V10 15000 non-null float64 10 V11 15000 non-null float64 11 V12 15000 non-null float64 12 V13 15000 non-null float64 13 V14 15000 non-null float64 14 V15 15000 non-null float64 15 V16 15000 non-null float64 16 V17 15000 non-null float64 17 V18 15000 non-null float64 18 V19 15000 non-null float64 19 V20 15000 non-null float64 20 V21 15000 non-null float64 21 V22 15000 non-null float64 22 V23 15000 non-null float64 23 V24 15000 non-null float64 24 V25 15000 non-null float64 25 V26 15000 non-null float64 26 V27 15000 non-null float64 27 V28 15000 non-null float64 28 V29 15000 non-null float64 29 V30 15000 non-null float64 30 V31 15000 non-null float64 31 V32 15000 non-null float64 32 V33 15000 non-null float64 33 V34 15000 non-null float64 34 V35 15000 non-null float64 35 V36 15000 non-null float64 36 V37 15000 non-null float64 37 V38 15000 non-null float64 38 V39 15000 non-null float64 39 V40 15000 non-null float64 dtypes: float64(40) memory usage: 4.6 MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 5000 non-null float64 1 V2 5000 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 dtypes: float64(40) memory usage: 1.5 MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 5000 non-null float64 1 V2 5000 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 dtypes: float64(40) memory usage: 1.5 MB
Observations
Missing values have been treated. We can attest to the readiness of the data for model building.
Model Building¶
Model evaluation criterion¶
The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.
Which metric to optimize?
- We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
- We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
- We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
Defining scorer to be used for cross-validation and hyperparameter tuning¶
- We want to reduce false negatives and will try to maximize "Recall".
- To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Model Building with original data¶
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val)) * 100
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: Decision Tree: 69.82829521679533 Logistic Regression: 49.27566553639709 Random Forest: 72.35192266070268 AdaBoost: 63.09140754635308 Gradient Boost: 70.66661857008873 Bagging: 72.1080730106053 Validation Performance: Decision Tree: 70.50359712230215 Logistic Regression: 48.201438848920866 Random Forest: 72.66187050359713 AdaBoost: 67.62589928057554 Gradient Boost: 72.3021582733813 Bagging: 73.02158273381295
Observations
On the cross-validation performance on the training dataset, Adaboost and Logistic regression have the lowest scores of 53.7% and 49.3% respectively. These could indicate that they may not capture the underlying patterns in the dataset. Performance on Decison tree doesn't look bad with 69.8%. Performance on Random Forest looks very good with 72.4%, followed by Bagging with 72.1%, and Gradient boosting of 70.7%.
On the validation performance, Random Forest improved from 72.4% to 72.7, retaining the best score. It generalizes well with new data. Bagging improved from 72.1% to 73.0%. Gradient Boosting also improved from 70.7% to 72.3%. This suggests stability in the model.
Logistic regression has dropped in performance, suggesting it might not be too strong a model for our dataset.
Decision tree has seen an improvement in its performance. It generalizes well from training to validation.
Adaboost has improved slightly.
Random forest happens to be the best model since it has the the best score and performs well in both training and validation. The variability in its performance is also moderate.
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
Observations
Gradient Boost and Random Forest give a higher median score, with Random Forest having the highest median performance (~0.73) showing a better performance. While the logistic regression and AdaBoost are far behind. Bagging's performance is consistently high, except for one outlier. Decision Tree has seen some improvement in performance. Bagging has also seen some improvement.
Model Building with Oversampled data¶
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 832 Before OverSampling, counts of label '0': 14168 After OverSampling, counts of label '1': 14168 After OverSampling, counts of label '0': 14168 After OverSampling, the shape of train_X: (28336, 40) After OverSampling, the shape of train_y: (28336,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree Over", DecisionTreeClassifier(random_state=1)))
models.append(("Logistic Regression Over", LogisticRegression(random_state=1)))
models.append(("Random Forest Over", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost Over", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boost Over", GradientBoostingClassifier(random_state=1)))
models.append(("Bagging Over", BaggingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Cost: Decision Tree Over: 97.20494245534968 Logistic Regression Over: 88.3963699328486 Random Forest Over: 98.39075260047615 AdaBoost Over: 89.78689011775472 Gradient Boost Over: 92.56068151319724 Bagging Over: 97.62141471581656 Validation Performance: Decision Tree Over: 77.6978417266187 Logistic Regression Over: 84.89208633093526 Random Forest Over: 84.89208633093526 AdaBoost Over: 85.61151079136691 Gradient Boost Over: 87.76978417266187 Bagging Over: 83.45323741007195
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
plt.xticks(rotation=45)
ax.set_xticklabels(names)
plt.show()
Observations
Random Forest performs exceptionally well with highest median performance (~0.98)
Bagging and Decision Tree also show strong performance: Both around 0.97 median.
Note that Bagging has one outlier. Decision Tree shows consistent performance Gradient Boost shows lower performance (~0.92) with some outliers, which is unusual as it typically performs similarly to Random Forest.
Logistic Regression has the lowest median (~0.88) but shows consistent results.
Model Building with Undersampled data¶
# Random undersampler for under sampling the data
print('Before Undersampling, counts of label "Yes" : {}'.format(sum(y_train == 1)))
print("Before Undersampling, counts of labels 'No': {}\n".format(sum(y_train == 0)))
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("After Undersampling, counts of label 'Yes' : {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'No' : {}".format(sum(y_train_un == 0)))
print("After Undersampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Undersampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label "Yes" : 832 Before Undersampling, counts of labels 'No': 14168 After Undersampling, counts of label 'Yes' : 832 After Undersampling, counts of label 'No' : 832 After Undersampling, the shape of train_X: (1664, 40) After Undersampling, the shape of train_y: (1664,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree Under", DecisionTreeClassifier(random_state=1)))
models.append(("Logistic Regression Under", LogisticRegression(random_state=1)))
models.append(("Random Forest Under", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost Under", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boost Under", GradientBoostingClassifier(random_state=1)))
models.append(("Bagging Under", BaggingClassifier(random_state=1)))
results2 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results2.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Cost: Decision Tree Under: 86.17776495202367 Logistic Regression Under: 87.26138085275232 Random Forest Under: 90.38669648654498 AdaBoost Under: 86.6611355602049 Gradient Boost Under: 89.90621167303946 Bagging Under: 86.41945025611427 Validation Performance: Decision Tree Under: 84.17266187050359 Logistic Regression Under: 85.25179856115108 Random Forest Under: 89.20863309352518 AdaBoost Under: 84.89208633093526 Gradient Boost Under: 88.84892086330936 Bagging Under: 87.05035971223022
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results2)
plt.xticks(rotation=45)
ax.set_xticklabels(names)
plt.show()
Observations
Random Forest stands out with the best overall performance, showing the highest median score around 0.91.
Gradient Boost follows with good and very consistent performance though it has one outlier.
Decision Tree and Bagging show lower performance, with medians around 0.86-0.87
AdaBoost shows moderate performance
HyperparameterTuning¶
The models that performed extremely well in the original, oversampled and undersampled data will be used. Overfitting or underfitting models will be taken into account. In this particular case, I'll consider the decision tree model as part of my top 3 for its simplicity and interpretability.
Based on the training and validation performance, the 3 best-performing models from each dataset are:
Original Data:
- Bagging - Validation: 73.02%
- Random Forest - Validation: 72.66%
- Gradient Boost - Validation: 72.30%
Oversampled Data:
- Random Forest Over - Validation: 84.89%
- Gradient Boost Over - Validation: 87.77%
- AdaBoost Over - Validation: 85.97%
Undersampled Data:
- Random Forest Under - Validation: 89.21%
- Gradient Boost Under - Validation: 88.85%
- Bagging Under - Validation: 87.05%
Final Recommendation: My top 3 best performing models are based on performance across original, oversampled, and undersampled data, along with their robustness and generalization ability, simplicity and interpretability.
These are the reasons for my choice of model:
- Random Forest
- Best performer across all datasets
- Strong generalization, handles overfitting well
- Suitable for high-dimensional data
- Gradient Boosting
- Consistently high validation accuracy
- Works well with imbalanced data (especially oversampled)
- Captures complex patterns effectively
3.Deciion Tree
- Easy to understand and interpret
- Requires less memory and processing power compared to Random Forest or Gradient Boosting
- Serve as a benchmark to compare more advanced models
The Hyperparameter tuning will be done at the following levels:
- Model Building with Original data (3 algorithms- Random Forest, Gradient Boosting, and Decision Tree)
- Model Building with Oversampled data (3 algorithms- Random Forest, Gradient Boosting, and Decision Tree)
- Model Buidling with Undersampled data (3 algorithms- Random Forest, Gradient Boosting, and Decision Tree)
Sample Parameter Grids¶
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
- For Gradient Boosting:
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
- For Adaboost:
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
- For Bagging Classifier:
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
- For Random Forest:
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
- For Decision Trees:
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
- For Logistic Regression:
param_grid = {'C': np.arange(0.1,1.1,0.1)}
- For XGBoost:
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
Tuning method for Random Forest with original data¶
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.6996248466921577:
#Training performance
tuned_rf = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_rf.fit(X_train, y_train)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_rf = model_performance_classification_sklearn(
tuned_rf, X_train, y_train
)
p_tuned_rf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995 | 0.909 | 1.000 | 0.952 |
# Validation performance
tuned_rf_val = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_rf_val.fit(X_val, y_val)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_rf_val = model_performance_classification_sklearn(tuned_rf_val, X_val, y_val)
p_tuned_rf_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993 | 0.867 | 1.000 | 0.929 |
def confusion_matrix_sklearn(model, X, y):
y_pred = model.predict(X)
cm = confusion_matrix(y, y_pred)
cm_percent = confusion_matrix(y, y_pred, normalize='true')
annot = np.char.add(cm.astype(str), np.char.add("\n(",np.char.add(np.round(cm_percent*100, 2).astype(str), np.char.add("%", ")"))))
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
plt.show()
return cm # return the confusion matrix object
p_tuned_rf_val_cm = confusion_matrix_sklearn(tuned_rf_val, X_val, y_val)
Tuning method for Random Forest with oversampled data¶
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9815078165615898:
tuned_over_rf = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_over_rf.fit(X_train_over, y_train_over)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_over_rf = model_performance_classification_sklearn(
tuned_over_rf, X_train_over, y_train_over
)
p_tuned_over_rf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 0.999 | 1.000 | 1.000 |
tuned_over_rf_val = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_over_rf_val.fit(X_val, y_val)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_over_rf_val = model_performance_classification_sklearn(tuned_over_rf_val, X_val, y_val)
p_tuned_over_rf_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993 | 0.867 | 1.000 | 0.929 |
def confusion_matrix_sklearn(model, X, y):
y_pred = model.predict(X)
cm = confusion_matrix(y, y_pred)
cm_percent = confusion_matrix(y, y_pred, normalize='true')
annot = np.char.add(cm.astype(str), np.char.add("\n(",np.char.add(np.round(cm_percent*100, 2).astype(str), np.char.add("%", ")"))))
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
plt.show()
return cm # return the confusion matrix object
p_tuned_over_rf_val_cm = confusion_matrix_sklearn(tuned_over_rf_val, X_val, y_val)
p_tuned_over_rf_val_cm
array([[4722, 0],
[ 37, 241]])
Tuning method for Random Forest with undersampled data¶
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.8978140105331505:
tuned_un_rf = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_un_rf.fit(X_train_un, y_train_un)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_un_rf = model_performance_classification_sklearn(
tuned_un_rf, X_train_un, y_train_un
)
p_tuned_un_rf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988 | 0.977 | 0.999 | 0.988 |
tuned_un_rf_val = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.6,
max_features="sqrt",
random_state=1,
)
tuned_un_rf_val.fit(X_val, y_val)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
p_tuned_un_rf_val = model_performance_classification_sklearn(tuned_un_rf, X_val, y_val)
p_tuned_un_rf_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.944 | 0.885 | 0.496 | 0.636 |
def confusion_matrix_sklearn(model, X, y):
y_pred = model.predict(X)
cm = confusion_matrix(y, y_pred)
cm_percent = confusion_matrix(y, y_pred, normalize='true')
annot = np.char.add(cm.astype(str), np.char.add("\n(",np.char.add(np.round(cm_percent*100, 2).astype(str), np.char.add("%", ")"))))
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
plt.show()
return cm # return the confusion matrix object
p_tuned_un_rf_val_cm = confusion_matrix_sklearn(tuned_un_rf, X_val, y_val)
p_tuned_un_rf_val_cm
array([[4472, 250],
[ 32, 246]])
Tuning method for Gradient Boosting with original data¶
# Define the model
Model2 = GradientBoostingClassifier(random_state=1)
# Parameter grid for RandomSearchCV
param_grid = {
"n_estimators": np.arange(100, 150, 25),
"learning_rate": [0.2, 0.05, 1],
"subsample": [0.5, 0.7],
"max_features": [0.5, 0.7],
}
# RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model2,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer, # Assuming 'scorer' is defined
cv=5,
random_state=1,
)
# Fit the model
randomized_cv.fit(X_train, y_train) # Using original data
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.754895029218671:
tuned_gb = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_gb.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_gb = model_performance_classification_sklearn(
tuned_gb, X_train, y_train
)
p_tuned_gb
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963 | 0.600 | 0.697 | 0.645 |
tuned_gb_val = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_gb_val.fit(X_val, y_val)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_gb_val = model_performance_classification_sklearn(
tuned_gb, X_val, y_val
)
p_tuned_gb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953 | 0.507 | 0.588 | 0.544 |
p_tuned_gb_val_cm = confusion_matrix_sklearn(tuned_gb, X_val, y_val)
p_tuned_gb_val_cm
array([[4623, 99],
[ 137, 141]])
Tuning method for Gradient Boosting with oversampled data¶
# defining model
Model2 = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = param_grid = {
"n_estimators": np.arange(100, 150, 25),
"learning_rate": [0.2, 0.05, 1],
"subsample": [0.5, 0.7],
"max_features": [0.5, 0.7],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model2,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 1} with CV score=0.9723322092856124:
tuned_over_gb = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_over_gb.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_over_gb = model_performance_classification_sklearn(
tuned_over_gb, X_train_over, y_train_over
)
p_tuned_over_gb
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993 | 0.992 | 0.994 | 0.993 |
tuned_over_gb_val = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_over_gb_val.fit(X_val, y_val)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_over_gb_val = model_performance_classification_sklearn(
tuned_over_gb, X_val, y_val
)
p_tuned_over_gb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969 | 0.856 | 0.678 | 0.757 |
p_tuned_over_gb_val_cm = confusion_matrix_sklearn(tuned_over_gb, X_val, y_val)
p_tuned_over_gb_val_cm
array([[4609, 113],
[ 40, 238]])
Tuning method for Gradient Boosting with Undersampled data¶
# Undersample the training data
rus = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
# Define the model
Model2 = GradientBoostingClassifier(random_state=1)
# Parameter grid for RandomSearchCV
param_grid = {
"n_estimators": np.arange(100, 150, 25),
"learning_rate": [0.2, 0.05, 1],
"subsample": [0.5, 0.7],
"max_features": [0.5, 0.7],
}
# RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model2,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer, # Assuming 'scorer' is defined
cv=5,
random_state=1,
)
# Fit the model
randomized_cv.fit(X_train_under, y_train_under)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.2} with CV score=0.9014212538777866:
tuned_un_gb = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_un_gb.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_un_gb = model_performance_classification_sklearn(
tuned_un_gb, X_train_un, y_train_un
)
p_tuned_un_gb
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985 | 0.981 | 0.989 | 0.985 |
tuned_un_gb_val = GradientBoostingClassifier(
subsample=0.7, n_estimators=125, max_features=0.5, learning_rate=1, random_state=1
)
tuned_un_gb_val.fit(X_val, y_val)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)p_tuned_un_gb_val = model_performance_classification_sklearn(
tuned_un_gb, X_val, y_val
)
p_tuned_un_gb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.879 | 0.874 | 0.299 | 0.446 |
p_tuned_un_gb_val_cm = confusion_matrix_sklearn(tuned_un_gb, X_val, y_val)
p_tuned_un_gb_val_cm
array([[4153, 569],
[ 35, 243]])
Tuning method for Decision tree with original data¶
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5684366207344347:
tuned_dt = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb
random_state=1 # Use the same random_state as tuned_gb
)
# Fit the DecisionTreeClassifier to the training data
tuned_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_dt = model_performance_classification_sklearn(
tuned_dt, X_train, y_train
)
p_tuned_dt
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Create a DecisionTreeClassifier with similar hyperparameters
tuned_dt_val = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb_val
random_state=1 # Use the same random_state as tuned_gb_val
)
# Fit the DecisionTreeClassifier to the validation data
tuned_dt_val.fit(X_val, y_val)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_dt_val = model_performance_classification_sklearn(
tuned_dt, X_val, y_val
)
p_tuned_dt_val
p_tuned_dt_val_cm = confusion_matrix_sklearn(tuned_dt, X_val, y_val)
p_tuned_dt_val_cm
array([[4649, 73],
[ 80, 198]])
Tuning method for Decision tree with oversampled data¶
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3} with CV score=0.9102913265648006:
tuned_over_dt = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb
random_state=1 # Use the same random_state as tuned_gb
)
# Fit the DecisionTreeClassifier to the training data
tuned_over_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_over_dt = model_performance_classification_sklearn(
tuned_over_dt, X_train, y_train
)
p_tuned_over_dt
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Create a DecisionTreeClassifier with similar hyperparameters
tuned_over_dt_val = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb_val
random_state=1 # Use the same random_state as tuned_gb_val
)
# Fit the DecisionTreeClassifier to the validation data
tuned_over_dt_val.fit(X_val, y_val)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_over_dt_val = model_performance_classification_sklearn(
tuned_over_dt, X_val, y_val
)
p_tuned_over_dt_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969 | 0.712 | 0.731 | 0.721 |
p_tuned_over_dt_val_cm = confusion_matrix_sklearn(tuned_over_dt, X_val, y_val)
p_tuned_over_dt_val_cm
array([[4649, 73],
[ 80, 198]])
Tuning method for Decision tree with undersampled data¶
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
'min_samples_leaf': [1, 2, 5, 7],
'max_leaf_nodes' : [5, 10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 5, 'max_depth': 2} with CV score=0.850811629752543:
tuned_un_dt = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb
random_state=1 # Use the same random_state as tuned_gb
)
# Fit the DecisionTreeClassifier to the training data
tuned_un_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_un_dt = model_performance_classification_sklearn(
tuned_un_dt, X_train, y_train
)
p_tuned_un_dt
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Create a DecisionTreeClassifier with similar hyperparameters
tuned_un_dt_val = DecisionTreeClassifier(
max_features=0.5, # Use the same max_features as tuned_gb_val
random_state=1 # Use the same random_state as tuned_gb_val
)
# Fit the DecisionTreeClassifier to the validation data
tuned_un_dt_val.fit(X_val, y_val)
DecisionTreeClassifier(max_features=0.5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features=0.5, random_state=1)
p_tuned_un_dt_val = model_performance_classification_sklearn(
tuned_un_dt, X_val, y_val
)
p_tuned_un_dt_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969 | 0.712 | 0.731 | 0.721 |
p_tuned_un_dt_val_cm = confusion_matrix_sklearn(tuned_un_dt, X_val, y_val)
p_tuned_un_dt_val_cm
array([[4649, 73],
[ 80, 198]])
Model performance comparison and choosing the final model¶
# training performance comparison
models_train_comp_df = pd.concat(
[
p_tuned_un_gb.T,
p_tuned_over_gb.T,
p_tuned_un_rf.T,
p_tuned_over_rf.T,
p_tuned_over_dt.T,
p_tuned_un_dt.T,
],
axis=1,
)
models_train_comp_df.columns = [
"GB Under Sampled with Random Search",
"Random Forest Under Sampled with Random Search",
"GB Over Sampled with Random Search",
"Random Forest Over Sampled with Random search",
"DT Under Sampled with Random Search",
"DT Over Sampled with Random Search",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| GB Under Sampled with Random Search | Random Forest Under Sampled with Random Search | GB Over Sampled with Random Search | Random Forest Over Sampled with Random search | DT Under Sampled with Random Search | DT Over Sampled with Random Search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.985 | 0.993 | 0.988 | 1.000 | 1.000 | 1.000 |
| Recall | 0.981 | 0.992 | 0.977 | 0.999 | 1.000 | 1.000 |
| Precision | 0.989 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 |
| F1 | 0.985 | 0.993 | 0.988 | 1.000 | 1.000 | 1.000 |
# validation performance comparison
models_train_comp_df = pd.concat(
[
p_tuned_un_gb_val.T,
p_tuned_over_gb_val.T,
p_tuned_un_rf_val.T,
p_tuned_over_rf_val.T,
p_tuned_over_dt_val.T,
p_tuned_un_dt_val.T,
],
axis=1,
)
models_train_comp_df.columns = [
"GB Under Sampled with Random Search",
"Random Forest Under Sampled with Random Search",
"GB Over Sampled with Random Search",
"Random Forest Over Sampled with Random search",
"DT Under Sampled with Random Search",
"DT Over Sampled with Random Search",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| GB Under Sampled with Random Search | Random Forest Under Sampled with Random Search | GB Over Sampled with Random Search | Random Forest Over Sampled with Random search | DT Under Sampled with Random Search | DT Over Sampled with Random Search | |
|---|---|---|---|---|---|---|
| Accuracy | 0.879 | 0.969 | 0.944 | 0.993 | 0.969 | 0.969 |
| Recall | 0.874 | 0.856 | 0.885 | 0.867 | 0.712 | 0.712 |
| Precision | 0.299 | 0.678 | 0.496 | 1.000 | 0.731 | 0.731 |
| F1 | 0.446 | 0.757 | 0.636 | 0.929 | 0.721 | 0.721 |
Observations
Gradient Boosting oversampled data has a recall value of 0.885.
Random Forest oversampled has a recall value of 0.867
Decision Tree did not provide any change.
Let's check the test data Gradient Boosting Oversampled, Random Forest oversampled, and decision tree oversampled.
Test set final performance¶
Gradient Boosting¶
tuned_over_gb_test = model_performance_classification_sklearn(
tuned_over_gb, X_test, y_test
)
print("Test Performance")
tuned_over_gb_test
Test Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965 | 0.840 | 0.648 | 0.731 |
tuned_over_gb_test = confusion_matrix_sklearn(tuned_over_gb, X_test, y_test)
tuned_over_gb_test
array([[4589, 129],
[ 45, 237]])
feature_names = X.columns
importances = tuned_over_gb.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Random Forest¶
tuned_over_rf_test = model_performance_classification_sklearn(tuned_over_rf, X_test, y_test)
print("Test Performance")
tuned_over_rf_test
Test Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.987 | 0.840 | 0.929 | 0.883 |
tuned_over_rf_test = confusion_matrix_sklearn(tuned_over_rf, X_test, y_test)
tuned_over_rf_test
array([[4700, 18],
[ 45, 237]])
feature_names = X.columns
importances = tuned_over_rf.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Decision Tree¶
tuned_over_dt_test = model_performance_classification_sklearn(
tuned_over_dt, X_test, y_test
)
print("Test Performance")
tuned_over_dt_test
Test Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.702 | 0.717 | 0.710 |
tuned_over_dt_test = confusion_matrix_sklearn(tuned_over_dt, X_test, y_test)
tuned_over_dt_test
array([[4640, 78],
[ 84, 198]])
feature_names = X.columns
importances = tuned_over_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
The best performing, with a very high accuracy, robustness to overfitting, stability in feature importance, and efficiency for different sampling techniques will be the Random Forest. It is followed closely by Gradient Boosting but requires extensive tuning. Decision Trees are simple and maybe overfitting.
Pipelines to build the final model¶
# Def
rf_best = RandomForestClassifier(random_state=42)
# Create the pipeline with the trained model
Pipeline_model = Pipeline(steps=[('rf_best', rf_best)]) # Change 'stages' to 'steps'
# Separating target variable and other variables
X1 = data.drop(columns="Target")
Y1 = data["Target"]
# Since we already have a separate test set, we don't need to divide data into train and test
X_test1 = df_test.drop(columns="Target") ## Complete the code to drop target variable from test data
y_test1 = df_test["Target"] ## Complete the code to store target variable in y_test1
# We can't oversample/undersample data without doing missing value treatment, so let's first treat the missing values in the train set
imputer = SimpleImputer(strategy="median")
X1 = imputer.fit_transform(X1)
# We don't need to impute missing values in test set as it will be done inside pipeline
Note: Please perform either oversampling based on the final model chosen.
# #code for oversampling on the data
# # Synthetic Minority Over Sampling Technique
# sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
# X_over1, y_over1 = sm.fit_resample(X1, Y1)
Pipeline_model.fit(X1, Y1) # Changed code to fit the Model obtained from above step using both features and target
Pipeline(steps=[('rf_best', RandomForestClassifier(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('rf_best', RandomForestClassifier(random_state=42))])RandomForestClassifier(random_state=42)
# Assuming 'imputer' is the SimpleImputer instance you created earlier
X_test1 = imputer.transform(X_test1) # Impute missing values in the test set
Pipeline_model_test = recall_score(y_test1, Pipeline_model.predict(X_test1))
Pipeline_model_test
0.723404255319149
Pipeline_model_test = accuracy_score(y_test1, Pipeline_model.predict(X_test1)) ## Complete the code to check the performance on test set
Pipeline_model_test
0.9838
Business Insights and Conclusions¶
Random Forest oversampling is performing very well.
The following features have the most impact in determining whether the wind will fail or not: V36, V18, V39, V15
Focus on improving features that have the most impact in order to prevent failures, thereby increasing repair costs.
The company should continue to use predictive models that demonstrate a cause and effect relationships. In order to be able to predict outcomes, it is important to measure and monitor what drivers most likely cause the outcomes to occur.
The predictive model should be relevant, reliable, and timely for decision makers.
Data integrity is key. Renewind should have the ability to establish data standards and data quality practices.
Integrate predictive analytics into Renewind's management processes.